In this week’s lab, the main goal is to learn and practice how to wrangle a data set. On the due date, turn in your Rmd file and the html product.
Open your project for this class. Make sure all your work is done relative to this project.
Open the lab3.Rmd file provided with the instructions. You can edit this file and add your answers to questions in this document.
This is about the french fries example, from lectures notes, make sure the code works for you, and answer these questions:
Some subjects rate the chips lower on potato'y over time, regardless of oil. But for most subjects there is no trend.
It looks like the rating on grassy goes down, particularly for weeks 7-9, but they pop up again in week 10. Its hard to say there is a trend.
data(french_fries, package = "reshape2")
french_fries <- as_tibble(french_fries)
ff_m <- french_fries %>%
gather(type, rating, -subject, -time, -treatment, -rep)
ff_av <- ff_m %>%
filter(type == "potato") %>%
group_by(subject, time, treatment) %>%
summarise(rating=mean(rating))
ggplot(ff_av, aes(x=time, y=rating, colour=treatment, group=treatment)) +
geom_line() +
facet_wrap(~subject)
ff_grassy <- ff_m %>%
filter(type == "grassy")
ggplot(ff_grassy, aes(x=treatment, y=rating)) +
geom_boxplot() + scale_y_sqrt() +
facet_wrap(~time)
For the airlines data, shown in lecture notes, join the airports location with the flight table, and answer these questions:
# A tibble: 146 x 4
ORIGIN ORIGIN_LATITUDE ORIGIN_LONGITUDE
<chr> <dbl> <dbl>
1 CVG 39.04889 -84.66778
2 DFW 32.89722 -97.03778
3 DFW 32.89722 -97.03778
4 STL 38.74861 -90.37000
5 IND 39.71722 -86.29472
6 CHS 32.89861 -80.04056
7 DFW 32.89722 -97.03778
8 DFW 32.89722 -97.03778
9 MKE 42.94694 -87.89694
10 MKE 42.94694 -87.89694
# ... with 136 more rows, and 1 more variables: DISPLAY_AIRPORT_NAME <chr>
twiceN4YRAA_latlon %>%
filter(FL_DATE == ymd("2017-05-06")) %>%
select(ORIGIN, DEST)
# A tibble: 3 x 2
ORIGIN DEST
<chr> <chr>
1 DAY DFW
2 DFW DSM
3 DSM DFW
3 timesN4YRAA_latlon %>%
filter(FL_DATE == ymd("2017-05-07")) %>%
select(ORIGIN, DEST)
# A tibble: 6 x 2
ORIGIN DEST
<chr> <chr>
1 DFW MFE
2 MFE DFW
3 DFW IAH
4 IAH DFW
5 DFW MSY
6 MSY DFW
N4YRAA_latlon %>% count(ORIGIN) %>% filter(ORIGIN == "DFW")
# A tibble: 1 x 2
ORIGIN n
<chr> <int>
1 DFW 57
No, we can see by just looking at Mondays, that 1/5 and 8/5 have different routes.N4YRAA_latlon %>% mutate(wday=wday(FL_DATE, label=TRUE)) %>%
filter(wday == "Mon")
# A tibble: 20 x 15
FL_DATE CARRIER FL_NUM ORIGIN DEST DEP_TIME ARR_TIME DISTANCE
<date> <chr> <int> <chr> <chr> <chr> <chr> <dbl>
1 2017-05-01 AA 2663 DAY DFW 0558 0739 861
2 2017-05-01 AA 970 DFW MCI 0852 1027 460
3 2017-05-01 AA 970 MCI DFW 1109 1256 460
4 2017-05-01 AA 2358 DFW MSY 1350 1507 447
5 2017-05-01 AA 2358 MSY DFW 1551 1720 447
6 2017-05-01 AA 2338 DFW CHS 1904 2225 987
7 2017-05-08 AA 137 DFW PIT 0657 1032 1067
8 2017-05-08 AA 188 PIT DFW 1119 1301 1067
9 2017-05-08 AA 2458 DFW PIT 1425 1754 1067
10 2017-05-08 AA 2458 PIT DFW 1852 2038 1067
11 2017-05-08 AA 2176 DFW OKC 2207 2301 175
12 2017-05-22 AA 1252 ORD RDU 0732 1021 646
13 2017-05-22 AA 1252 RDU ORD 1117 1221 646
14 2017-05-22 AA 312 ORD MSP 1338 1456 334
15 2017-05-22 AA 46 MSP ORD 1600 1734 334
16 2017-05-22 AA 1323 ORD STL 1855 2017 258
17 2017-05-29 AA 1556 DFW IAD 0659 1037 1172
18 2017-05-29 AA 1556 IAD DFW 1133 1354 1172
19 2017-05-29 AA 386 DFW JAX 1517 1830 918
20 2017-05-29 AA 386 JAX DFW 1911 2042 918
# ... with 7 more variables: ORIGIN_LATITUDE <dbl>,
# ORIGIN_LONGITUDE <dbl>, DISPLAY_AIRPORT_NAME.x <chr>,
# DEST_LATITUDE <dbl>, DEST_LONGITUDE <dbl>,
# DISPLAY_AIRPORT_NAME.y <chr>, wday <ord>
N4YRAA_latlon %>% summarise(sum(DISTANCE))
# A tibble: 1 x 1
`sum(DISTANCE)`
<dbl>
1 90089
N4YRAA_latlon %>% summarise(mean(DISTANCE))
# A tibble: 1 x 1
`mean(DISTANCE)`
<dbl>
1 612.8503
This question extends the analysis of the whaleshark data was pulled from https://www.whaleshark.org in 2013. It lists verified encounters with whale sharks across the globe.
whalesharks <- read_csv("whaleshark-encounters.csv")
grampusr@hotmail.com, I think this is Rafael de la Parra, who is very active in whale shark conservation.whalesharks %>% count(`Submitter Email Address`, sort=TRUE)
# A tibble: 2,749 x 2
`Submitter Email Address` n
<chr> <int>
1 grampusr@hotmail.com 3370
2 wsphotoid@yahoo.com 2236
3 dani.rob@dec.wa.gov.au 1267
4 darcy@whaleshark.org 968
5 brad@whaleshark.org 914
6 simon@marinemegafauna.org 802
7 darcybradley@Gmail.com 575
8 adove@georgiaaquarium.org 525
9 sharkwatcharabia@gmail.com 434
10 whaleshark@dec.wa.gov.au 409
# ... with 2,739 more rows
Overall: April, May, June, July; in the southern hemisphere it is clearly April, May, Junewhalesharks %>% count(`Month Collected`)
# A tibble: 13 x 2
`Month Collected` n
<int> <int>
1 1 853
2 2 911
3 3 1444
4 4 4017
5 5 3823
6 6 3161
7 7 2837
8 8 2305
9 9 1375
10 10 1030
11 11 765
12 12 791
13 NA 52
whalesharks %>% filter(Latitude < 0) %>% count(`Month Collected`)
# A tibble: 13 x 2
`Month Collected` n
<int> <int>
1 1 291
2 2 299
3 3 347
4 4 2241
5 5 2686
6 6 1882
7 7 961
8 8 184
9 9 183
10 10 305
11 11 274
12 12 203
13 NA 6
library(lubridate)
ningaloo <- whalesharks %>% filter(grepl("Ningaloo", Locality)) %>%
mutate(date=ymd(paste(`Year Collected`, `Month Collected`,
`Day Collected`, sep="-")))
62 timeslibrary(ggmap)
map <- get_map(location=c(lon=114.1, lat=-21.9), zoom=8)
ningaloo %>% count(`Marked Individual`, sort=TRUE)
# A tibble: 774 x 2
`Marked Individual` n
<chr> <int>
1 <NA> 754
2 A-524 62
3 A-098 57
4 A-001 55
5 A-102 51
6 A-534 51
7 A-268 47
8 A-425 44
9 A-529 44
10 A-227 43
# ... with 764 more rows
A524 <- ningaloo %>% filter(`Marked Individual`=="A-524")
ggmap(map) +
geom_point(data=A524, aes(x=Longitude, y=Latitude), colour="red") +
geom_line(data=A524, aes(x=Longitude, y=Latitude), colour="red")
ningaloo_nomiss <- ningaloo %>% filter(!is.na(`Marked Individual`))
ningaloo_nomiss %>% count(`Marked Individual`, sort=TRUE) %>%
ggplot(aes(x=n)) + geom_histogram(binwidth=5)
keep <- ningaloo_nomiss %>%
count(`Marked Individual`, sort=TRUE) %>%
filter(n>=40)
ningaloo_freqwhales <- ningaloo %>%
filter(`Marked Individual` %in% keep$`Marked Individual`) %>%
arrange(date)
Nope, the individuals are pretty much all in a group. Nope, they hang in the same neighborhoods each year. There seems to be a bit more movement April-June.ggmap(map) + geom_point(data=ningaloo_freqwhales, aes(x=Longitude, y=Latitude,
colour=`Marked Individual`)) +
geom_line(data=ningaloo_freqwhales,
aes(x=Longitude, y=Latitude,
colour=`Marked Individual`, group=`Marked Individual`)) +
facet_wrap(~`Marked Individual`, ncol=5) + theme_map() +
theme(legend.position="None")
ggmap(map) + geom_point(data=ningaloo_freqwhales, aes(x=Longitude, y=Latitude,
colour=`Marked Individual`)) +
geom_line(data=ningaloo_freqwhales,
aes(x=Longitude, y=Latitude,
colour=`Marked Individual`, group=`Marked Individual`)) +
facet_wrap(~`Year Collected`, ncol=5) + theme_map() +
theme(legend.position="None")
ggmap(map) + geom_point(data=ningaloo_freqwhales,
aes(x=Longitude, y=Latitude,
colour=`Marked Individual`)) +
geom_line(data=ningaloo_freqwhales,
aes(x=Longitude, y=Latitude,
colour=`Marked Individual`, group=`Marked Individual`)) +
facet_wrap(~`Month Collected`, ncol=4) + theme_map() +
theme(legend.position="None")
The file budapest.csv has a subset of web click through data related to hotel searches for Budapest. Each line in this data corresponds to a summary of a person looking for a hotel on the Expedia web site. For these questions, the from last lab, do the wrangling and answer them now.
budapest <- read_csv("budapest.csv")
3406). 0.00102, very fewbudapest %>% count(CLICK_THRU_TYP_ID) %>% mutate(p=n/sum(n))
# A tibble: 4 x 3
CLICK_THRU_TYP_ID n p
<int> <int> <dbl>
1 3402 945 0.01890
2 3404 217 0.00434
3 3406 51 0.00102
4 NA 48787 0.97574
Oh, there are only three days where searches are being done for Budapest! Most searches are on Thur.library(lubridate)
budapest %>% mutate(wday = wday(SRCH_DATETM, label=TRUE)) %>%
count(wday)
# A tibble: 3 x 2
wday n
<ord> <int>
1 Thurs 25441
2 Fri 18249
3 Sat 6310
Looks like it is about 2 weeks - 15 days - the date is very messy, with strange differences like -40087 days, and 4038 days.date_diff <- budapest %>%
mutate(SRCH_BEGIN_USE_DATE=ymd(SRCH_BEGIN_USE_DATE),
SRCH_DATETM=as.Date(ymd_hms(SRCH_DATETM))) %>%
select(SRCH_BEGIN_USE_DATE, SRCH_DATETM) %>%
mutate(dif = as.numeric(SRCH_BEGIN_USE_DATE - SRCH_DATETM))
summary(date_diff$dif)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-40087 1 15 -7910 44 4038
(1pt) There are a lot of missing values in the data, number of NAs, particularly this is true for the booking variable. If an NA essentially means that the person searching quit the site without doing a booking, how would you recode the missing value? Replace them with 0, indicating not booked.
(2pt) If I want to answer the question “Does the existence of a promotion, IS_PROMO_FLAG, tend to result in a higher likelihood of a booking?” what operations do you need to do on the data. No promotion: 9.421185910^{-4}, With promotion: 0.0011355. Appears to help a little - if there were 100000 searches, a promotion would lead to 20 more bookings.
budapest %>% replace_na(list(CLICK_THRU_TYP_ID=0)) %>%
mutate(booked=ifelse(CLICK_THRU_TYP_ID == 3406, "Y", "N")) %>%
group_by(IS_PROMO_FLAG) %>%
count(booked)
# A tibble: 6 x 3
# Groups: IS_PROMO_FLAG [3]
IS_PROMO_FLAG booked n
<chr> <chr> <int>
1 N N 33934
2 N Y 32
3 Y N 14955
4 Y Y 17
5 <NA> N 1060
6 <NA> Y 2
This question is about the 2015 PISA results. (We used the 2012 data in lab 1.) The data is downloaded from http://www.oecd.org/pisa/data/2015database/. The SPSS format “Student questionnaire data file (419MB)” is downloaded and processed using this code, to extract results for Australia:
library(haven)
pisa_2015 <- read_sav(file.choose())
pisa_au <- pisa_2015 %>% filter(CNT == "AUS")
save(pisa_au, file="pisa_au.rda")
You don’t need to do this, because the Australia data is extracted and saved for you. Your task is to answer these questions about Australia. At times it may be helpful to examine the data dictionary, which is provided as an excel file (you can also download this directly from the OECD PISA site too).
A large amount of pre-processing is done on the data, as performed by this code:
load("pisa_au.rda")
pisa_au <- pisa_au %>% mutate(state=as.character(substr(STRATUM, 4, 5)),
schtype_yr=as.character(substr(STRATUM, 6, 7))) %>%
mutate(state=recode(state, "01"="ACT", "02"="NSW", "03"="VIC",
"04"="QLD", "05"="SA", "06"="WA", "07"="TAS", "08"="NT")) %>%
mutate(schtype_yr=recode(schtype_yr,
"01"="Catholic_Y10", "02"="Catholic_noY10",
"03"="Gov_Y10", "04"="Gov_noY10",
"05"="Ind_Y10", "06"="Ind_noY10",
"07"="Catholic_Y10", "08"="Catholic_noY10",
"09"="Gov_Y10", "10"="Gov_noY10",
"11"="Ind_Y10", "12"="Ind_noY10",
"13"="Catholic_Y10", "14"="Catholic_noY10",
"15"="Gov_Y10", "16"="Gov_noY10",
"17"="Ind_Y10", "18"="Ind_noY10",
"19"="Catholic_Y10", "20"="Catholic_noY10",
"21"="Gov_Y10", "22"="Gov_noY10",
"23"="Ind_Y10", "24"="Ind_noY10",
"25"="Catholic_Y10", "26"="Catholic_noY10",
"27"="Gov_Y10", "28"="Gov_noY10",
"29"="Ind_Y10", "30"="Ind_noY10",
"31"="Catholic_Y10", "32"="Catholic_noY10",
"33"="Gov_Y10", "34"="Gov_noY10",
"35"="Ind_Y10", "36"="Ind_noY10",
"37"="Catholic_Y10", "38"="Catholic_noY10",
"39"="Gov_Y10", "40"="Gov_noY10",
"41"="Ind_Y10", "42"="Ind_noY10",
"43"="Catholic_Y10", "44"="Catholic_noY10",
"45"="Gov_Y10", "46"="Gov_noY10",
"47"="Ind_Y10", "48"="Ind_noY10")) %>%
separate(schtype_yr, c("schtype","yr")) %>%
rename(birthmonth=ST003D02T, birthyr=ST003D03T,
gender=ST004D01T, desk=ST011Q01TA,
room=ST011Q02TA, computer=ST011Q04TA, internet=ST011Q06TA,
solarpanels=ST011D17TA, tvs=ST012Q01TA, cars=ST012Q02TA,
music_instr=ST012Q09NA, books=ST013Q01TA, birthcnt=ST019AQ01T,
mother_birthcnt=ST019BQ01T, father_birthcnt=ST019CQ01T,
test_anxiety=ST118Q01NA, ambitious=ST119Q04NA,
prefer_team=ST082Q01NA, make_friends_easy=ST034Q02TA,
tardy=ST062Q03TA, science_fun=ST094Q01NA, breakfast=ST076Q01NA,
work_pay=ST078Q10NA, sport=ST078Q11NA, internet_use=IC006Q01TA,
install_software=IC015Q02NA,
outhours_study=OUTHOURS, math_time=MMINS, read_time=LMINS,
science_time=SMINS, belong=BELONG,
anxtest=ANXTEST, motivat=MOTIVAT, language=LANGN,
home_edres=HEDRES, home_poss=HOMEPOS, wealth=WEALTH,
stuweight=W_FSTUWT) %>%
mutate(math=(PV1MATH+PV2MATH+PV3MATH+PV4MATH+PV5MATH+
PV6MATH+PV7MATH+PV8MATH+PV9MATH+PV10MATH)/10,
science=(PV1SCIE+PV2SCIE+PV3SCIE+PV4SCIE+PV5SCIE+
PV6SCIE+PV7SCIE+PV8SCIE+PV9SCIE+PV10SCIE)/10,
read=(PV1READ+PV2READ+PV3READ+PV4READ+PV5READ+
PV6READ+PV7READ+PV8READ+PV9READ+PV10READ)/10) %>%
select(state, schtype, yr, birthmonth, birthyr, gender, desk, room,
computer, internet, solarpanels, tvs, cars, music_instr, books,
birthcnt, mother_birthcnt, father_birthcnt, test_anxiety,
ambitious, prefer_team, make_friends_easy, tardy, science_fun,
breakfast, work_pay, sport, internet_use, install_software,
outhours_study, math_time, read_time, science_time, belong,
anxtest, motivat, language, home_edres, home_poss, wealth,
stuweight, math, science, read) %>%
mutate(gender=factor(gender, levels=1:2, labels=c("female", "male"))) %>%
mutate(birthmonth=factor(birthmonth, levels=1:12,
labels=c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",
"sep", "oct", "nov", "dec")))
The 4-5th digits code the state, and these are extracted into the state variable. The 6-7th digits code both school type and year tested. This latter one is complicated to detangle, so a new text string was created which was separated into the two variables.ACTpisa_au %>% group_by(state) %>% summarise(m=weighted.mean(math, stuweight))
# A tibble: 8 x 2
state m
<chr> <dbl>
1 ACT 505.4857
2 NSW 494.2584
3 NT 478.0795
4 QLD 486.3467
5 SA 489.3091
6 TAS 468.9917
7 VIC 498.6085
8 WA 503.8274
QLDpisa_au %>% group_by(state, gender) %>%
summarise(m=weighted.mean(math, stuweight)) %>%
spread(gender, m) %>%
mutate(dif=female-male)
# A tibble: 8 x 4
# Groups: state [8]
state female male dif
<chr> <dbl> <dbl> <dbl>
1 ACT 502.1651 508.7725 -6.607420
2 NSW 492.0413 496.4775 -4.436208
3 NT 463.6726 492.1481 -28.475479
4 QLD 486.8992 485.8188 1.080436
5 SA 485.8819 492.5379 -6.655965
6 TAS 466.9742 470.8269 -3.852642
7 VIC 491.9853 505.2567 -13.271372
8 WA 501.4735 506.1641 -4.690684
anxtest, and a simple regression model to answer this question.) Yes, it appears to reduce scores by about 12 points. Its a very weak model, through because R2 is only 2%.fit <- lm(math~anxtest, weights=stuweight, data=pisa_au)
summary(fit)
Call:
lm(formula = math ~ anxtest, data = pisa_au, weights = stuweight)
Weighted Residuals:
Min 1Q Median 3Q Max
-1645.99 -224.24 -29.83 194.74 1467.37
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 498.6578 0.7291 683.92 <2e-16 ***
anxtest -11.7413 0.7349 -15.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 356.3 on 13896 degrees of freedom
(632 observations deleted due to missingness)
Multiple R-squared: 0.01804, Adjusted R-squared: 0.01797
F-statistic: 255.3 on 1 and 13896 DF, p-value: < 2.2e-16
rename operation is doing. It is giving the original variable names more understandable names.